[SPARK-25102][Spark Core] Write Spark version information to Parquet …#22255
[SPARK-25102][Spark Core] Write Spark version information to Parquet …#22255space-d-n wants to merge 2 commits intoapache:masterfrom space-d-n:spark-version-info-to-parquet-footer
Conversation
|
@dbtsai Hello, I'm sorry for asking you directly, but for some reason jenkins did not generate message: "Can one of the admins verify this patch?". I just saw that you've reviewed some other PR. That's my first PR, so maybe I've done something wrong, while creating it. I will be grateful for review or any other advice. |
|
Is there any other project writing this into the footer? Tests on reading this back? |
.../src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetWriteSupport.scala
Outdated
Show resolved
Hide resolved
|
I would also rather write the justification for this change, for instance, linking the usage of this name in Parquet side, potential usage, etc. |
…file footer (Added test on reading writer.model.name from footer metadata)
|
Hello, @dbtsai, @HyukjinKwon . I added test on reading writer.model.name to PR. Justification for this change is below. |
|
Hi @rdblue, is it roughly good to do here in Spark? |
|
I don't think this fits the intent of the model name. The model name is intended to encode what the data model was that was written to Parquet. I can write Avro records to a Parquet file, for example, and we identify that using "avro" (and this could be done in Spark). That could be used if we need to interpret the data differently from a model, but it probably shouldn't include a version of that data model. The data model doesn't change with a version bump, so I think these should be logically separate. It would be reasonable to add a "spark.version" property with this information. Other data models add properties to the file's key-value metadata for their own use. Avro adds its schema, for example. |
|
I got your idea now. Apparently I was a little confused because of the description of tickets. |
|
@npoberezkin, Parquet already supports custom key-value metadata in the file footer. The Spark version would go there. |
|
Hi, @npoberezkin . Thank you for your first contribution. Could you update your PR to use custom key-value metadata according to the above advice of @rdblue ? Also, please use tag |
|
@rdblue Can we use created_by? |
|
@dongjoon-hyun Do you want to take this over? |
|
Also cc @hvanhovell |
|
Sure, @gatorsmile . |
|
BTW, @rdblue recommended key_value_metadata. Are we going to |
|
Currently, we put the metadata like the following. For the hive table, it looks like the following. So, I prefer to add |
|
That will go like the following. |
|
It seems to cause some inconsistency if we choose one of
I'll ignore the consistency of (2) for backward compatibility. |
|
Just to confirm it. |
|
That is the value used by Parquet-MR library. We had better not to touch it. Parquet MR reader can work differently based on that versions to handle some older Parquet writer bugs. |
|
Hi, All. New PR is made. Please move to #22932 for further discussion. |
## What changes were proposed in this pull request?
Currently, Spark writes Spark version number into Hive Table properties with `spark.sql.create.version`.
```
parameters:{
spark.sql.sources.schema.part.0={
"type":"struct",
"fields":[{"name":"a","type":"integer","nullable":true,"metadata":{}}]
},
transient_lastDdlTime=1541142761,
spark.sql.sources.schema.numParts=1,
spark.sql.create.version=2.4.0
}
```
This PR aims to write Spark versions to ORC/Parquet file metadata with `org.apache.spark.sql.create.version` because we used `org.apache.` prefix in Parquet metadata already. It's different from Hive Table property key `spark.sql.create.version`, but it seems that we cannot change Hive Table property for backward compatibility.
After this PR, ORC and Parquet file generated by Spark will have the following metadata.
**ORC (`native` and `hive` implmentation)**
```
$ orc-tools meta /tmp/o
File Version: 0.12 with ...
...
User Metadata:
org.apache.spark.sql.create.version=3.0.0
```
**PARQUET**
```
$ parquet-tools meta /tmp/p
...
creator: parquet-mr version 1.10.0 (build 031a6654009e3b82020012a18434c582bd74c73a)
extra: org.apache.spark.sql.create.version = 3.0.0
extra: org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"id","type":"long","nullable":false,"metadata":{}}]}
```
## How was this patch tested?
Pass the Jenkins with newly added test cases.
This closes apache#22255.
Closes apache#22932 from dongjoon-hyun/SPARK-25102.
Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
What changes were proposed in this pull request?
Overrided method getName from org.apache.parquet.hadoop.api.WriteSupport in org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport that returns version of Spark (to write it to Parquet file footer later).